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Abstract. Most evolutionary approaches to fault recovery in FPGAs 
focus on evolving alternative logic configurations as opposed to evolving 
the intra-cell routing. Since the majority of transistors in a typical FPGA 
are dedicated to interconnect, nearly 80% according to one estimate, 
evolutionary fault-recovery systems should benefit by accommodating 
routing. In this paper, we propose an evolutionary fault-recovery system 
employing a genetic representation that takes into account both logic and 
routing configurations. Experiments were run using a software model of 
the Xilinx Virtex FPGA. We report that using four Virtex combinational 
logic blocks, w r e were able to evolve a 100% accurate quadrature decoder 
finite state machine in the presence of a stuck- at-zero fault. 


1 Introduction 

Numerous advantages of Field Programmable Gate Arrays (FPGAs) in space- 
borne electronics have been identified in recent research publications [3, 15] and 
manufacturers’ literature [1, 16]. Benefits include reconfiguration capability to 
support multiple missions, the ability to correct latent design errors after launch, 
and the potential to accommodate on-chip and off-chip failures. Ground Support 
Equipment (GSE) based FPGA applications primarily employ reprogrammable 
devices as a means of amortizing development costs over multiple missions. In 
GSE-enabled applications such as Reusable Launch Vehicles (RLVs), FPGAs 
are configured or replaced between missions rather than being reprogrammed 
during flight. For applications such as RLVs, comparatively short mission dura- 
tions and low levels of ionizing radiation are involved. Hence for many ground 
reconfigurable applications, conventional Triple Modular Redundancy (TMR) 
techniques often provide sufficient fault handling coverage. 

On the other hand, in-mission reconfigurable FPGAs are advantageous for 
deep space probes, satellites, and extraterrestrial rovers. In these applications, 
the radiation exposures, mission durations, and repair complexities are signifi- 
cantly greater. The need for adequate fault coverage during these missions has 
become further intensified by the increasing number of FPGAs being deployed. 



For instance, NASA’s Stardust probe contains over 100 FPGA devices. Although 
the Stardust’s FPGAs are based on a non- reprogrammable antifuse-based tech- 
nology, a more recent space-qualified SRAM- based technology has become com- 
mercially available. 

In SRAM-based devices, the number of programming cycles is unlimited. 
Hence new techniques become feasible for active recovery through reconfigura- 
tion of a compromised FPGA. The approach developed here concentrates on 
autonomous reconfiguration of SRAM-based devices while in-flight. The experi- 
ments conducted involv e_,XilinxlsJSRAJMir.bas edAllrl py-fee- 
family as the space- qualified QPRO radiation- hardened series. 

Permanent Single-Event Latchup (SEL) failures may impact CLBs and/or 
programmable interconnections within the FPGA itself. They may also involve 
other supporting devices that the FPGA interfaces with or processes data from. 
These failure modes also suggest that the ability to derive an alternative FPGA 
configuration in-situ would be beneficial. Likewise, SEL exposures exist with re- 
gards to the data processing path within the FPGA that is not involved with the 
device’s programmable configuration. In the above cases, the FPGA configura- 
tion derived at design time will no longer provide the required functionality for 
the damaged part. Traditionally, redundant spares have been utilized to replace 
the d'amaged device. 

Autonomous repair can work in concert with or provide an alternative to 
device redundancy. While redundant spares exist only in limited quantities, evo- 
lutionary recovery methods attempt to facilitate repair through reuse of dam- 
aged parts. Hence the potential benefits are two- fold. First, one or more failures 
might be accommodated by reconfiguring the failed part without incurring the 
increased w T eight, size, or power traditionally associated with providing redun- 
dant spares. Second, the characteristics of the failure need not be precisely diag- 
nosed in order to be repaired. Here the repair is performed in-situ via intrinsic 
evaluation of the device’s remaining functionality. This implies that any resid- 
ual functionality, including the electrical characteristics of both the damaged 
device and its interaction with any supporting devices, is taken into account 
when realizing the repair. After isolating the fault to a size that is manageable 
for the evolutionary algorithm, alternate solutions are refined though iterative 
selection. This can be carried out without detailed knowledge of the underlying 
failure mechanism itself. 

The approach developed here attempts to regain lost functionality due to 
a fault by evolving a new configuration on the defective FPGA. We assume a 
dual- redundant FPGA system whereby the faulty FPGA undergoes evolution 
to recover its functionality while the redundant FPGA maintains proper func- 
tionality during evolution on the faulty FPGA. Thus after a fault is detected, 
redundancy is lost for a short period of time and then restored. Application 
functionality is maintained throughout this process under the assumption that 
only one of the FPGAs fails. Our results are that the evolutionary methods are 
able to fully recover from a simulated stuck-at-zero fault in the input of a state 



machine implementing a quadrature decoder. Several research challenges remain 
and they are also discussed. 

2 Related Work 

Recently, various evolutionary algorithm approaches have been proposed for 
fault-recovery of FPGAs. Some previous work applies evolutionary algorithms 
prior to the occurrence of the fault while other approaches attempt to repair the 
fault-a-fteF-its-eeeur-renee.-Some-teehmques-invol-ve-intrinsie-evolut'ion-using-t-he- 
failed part itself. Others rely on extrinsic evolution of an abstracted model of 
the devices. 

Three examples of recent work that apply evolutionary algorithms to realize 
fault-tolerant designs include [11], [4], and [13j.In [11], Miller examined proper- 
ties of messy gates whereby evolved logic functions inherently contain redundant 
terms as their functional boundaries change and overlap. In [4], Canham and 
Tyrrell compare the fault tolerance of oscillators evolved by including a range of 
fault conditions within the fitness measure during the evolutionary process. A 
population-based approach scores evolved designs using a fitness function corre- 
sponding to desired operation based on the absence of faults. When evolution is 
complete, an additional pass evaluates the ability of the evolved individuals to 
tolerate a range of faults, and the most fault-tolerant individuals are retained. 
In [13], the evolution of designs containing redundant capabilities without the 
designer having to explicitly specify the redundant parts themselves was inves- 
tigated. To achieve this, a range of fault cases was introduced throughout the 
evolution process. This allowed individuals to exploit whatever component be- 
haviors exist, even behaviors known to be faulty. 

An evolutionary fault-recovery approach is described by Vigander [14]. He 
develops a genetic algorithm to restore functionality after random faults are in- 
jected into a 4-bit by 4-bit multiplier using standard genetic operators. He sim- 
ulated the repair of the prior-designed multiplier that consisted of feed- forward 
interconnection of hypothetical FPGA cells capable of 8 different logic functions. 
He used as his fitness function the number of correct input -output mappings from 
the 256 possible input combinations that could be applied to the multiplier. He 
demonstrated that while it is not exceedingly difficult to derive a solution that 
can produce a nearly correct repair, completely correct repairs present a chal- 
lenging problem. To remedy this, he demonstrated that a voting system with as 
few as three alternatively evolved repaired circuits w*as capable of producing a 
majority output that was completely correct. 

3 Representation and Operators 

Several goals were taken into account while designing the representation scheme. 
Amenability to recombination is of course a primary concern. After that, our 
priorities were to let the GA work in the largest, most flexible design space 
as possible: we wanted to allow all possible LUT configurations and allow the 



maximum number of CLB interconnections given the constraints of hardware 
routing support (we will say more about the routing at the end of this section). 
We also wanted to disallow illegal configurations and to minimize non-coding 
alleles (introns). 

Bitstring representations are a natural choice for FPGA applications, and 
many times the raw configuration string can be used as the representation. In 
our case, we chose a bitstring representation mainly out of convenience in pro- 
gramming. Since we knew that only a handful of CLBs would be evolved, our 
hi t S t rin gs would Jb_e _at jmosl JLTOIUiifs _long^_ >Ye ackno wl edge that this approach _ 
would likely suffer as more CLBs were utilized and the corresponding bitstring 
enlarged to thousands of bits. 

The representation is shown in Figure 1. This scheme is comprised of multiple 
128-bit fields, one for each CLB. Within each CLB field are a number of sub- 
fields that specify each of the LUT bits and remote connections. There are 16 
bits that specify the contents of each LL T T. Each LUT has four inputs, and 
since each of these inputs can be connected to other LUT outputs, the remote 
CLB /LUT requires addressing bits. Since our system will be comprised of four 
CLBs, we need only two bits to specify the remote CLB, and another two bits to 
specify the particular LUT within the CLB This pattern of sub-fields continues 
for each LUT until all the LUTs in the CLB are accounted for. An illustration 
of the CLBs, LUTs and sample routing is shown in Figure 2. 
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Fig. 1 . Genetic representation used showing logic fields and routing fields. 



Fig. 2. Example of routing among CLBs. 




The operators employed were crossover and mutation. Two-point crossover 
was implemented using cut points allowed between bits. Mutation was applied 
on individual bits. 

Regarding the routing, it is chosen automatically by the JBits software. Our 
circuits are sufficiently small that we have never experienced a situation wTiere 
a route could not be found. Successful routes have been found routing 1 LUT 
output to 48 different inputs (the maximum number of inputs available in a 
2-by-2 circuit, where 1 CLB is dedicated to external inputs). It is theoretically 
possible that in some lar ger designs the r outin g will become so dense that some 
routes would not be found. If such a case were to occur, the specific route will 
simply not get connected. Such individual would then most likely receive a low' 
fitness score, and be automatically eliminated from the gene pool. 

4 Fault Recovery Of Quadrature Decoder 

The quadrature decoder [2] was selected as an initial case study for -testing and 
refinement of our evolutionary recovery strategy. It represents a IS ASA applica- 
tion of manageable size that is appropriate for tuning of the GA. Quadrature 
decoders provide a means of counting objects passed back and forth through 
two beams of light, or alternatively determining the angular displacement and 
direction of rotation of an encoder w'heel turning about its axis. A quadrature 
decoder that determines the direction of rotation of a shaft is shown in Figure 3. 



Fig. 3. Rotating shaft application for a quadrature decoder. 


The concept of operation for the quadrature decoder is that the objects, or 
opaque arcs on the rotating wheel, to be counted will first obscure and then 
move past the two light beams in succession. The order in which the beams are 
cleared can be used to ascertain the direction of rotation. The use of two beams 
acts to preclude false counts due to jitter or bounce resulting from multiple 
phantom reads. For example, to have a valid increment in the rotational count, 
both beams must be cleared in succession. 

To implement the encoder, it is possible to employ a state machine that keeps 
track of the beam activity. The state machine accepts two single-bit inputs which 



are asserted only when the corresponding sensor is obscured. When a change of 
the inputs occurs, the state machine transitions to its next internal state. The 
state machine is asynchronous and outputs a zero bit if the wheel is rotating in 
one direction, and a one bit if the wheel is rotating in the opposite direction. If 
the wheel is not rotating, the output is the same as it previously was. The finite 
state machine for the quadrature decoder is shown in Figure 4. 
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Fig. 4. Quadrature decoder finite state machine. 


5 Experimental Setup and Results 

The software system used is depicted in Figure 5. The entire system is imple- 
mented in software. The GA software is ECU, a Java- based evolutionary com- 
putation and genetic programming system by Sean Luke of George Mason Uni- 
versity. ECJ is augmented by our code for tasks like decoding individuals and 
calculating fitness. The GA sits on top of Xilinx Corporation’s JBits software [5, 
8], a set of Java classes which provide an Application Programming Interface 
to access the Xilinx FPGA bitstream. Xilinx : s Virtex DS software, which sim- 
ulates the operation of Virtex devices, is used to test candidate solutions. Bor- 
land’s J Builder Java environment is used for development and to run the system, 
though Sun Microsystem’s Java virtual machine is used beneath JBuilder. 

To evaluate the fitness of an individual, an input stream of 500 bit pairs is 
used. These inputs attempt to fully exercise the evolving finite state machines. 
The quad decoder inputs are supplied at specified clock intervals, which have 
nothing to do with how often the finite state machine changes state. The quad 
decoder inputs can change at any frequency below the frequency of the decoder 
clock - they might never change (if the wheel is standing still), or they might 
change at every clock cycle (if the wheel is rotating at the maximum allowed 
velocity) . 





Fig. 5. Software system. 


The output stream consists of 510 bits sampled across all four CLBs. Ten 
bits have been added to allow for the delay (in clock cycles) between the time the 
decoder inputs are fed into the decoder, and the time the output of the decoder 
is read. Such arrangement allows us to read the output of the decoder from 1 
to 11 clock cycles after the inputs have been fed into the circuit. Adding ten 
bits gives ten output stream windows of length 500, with each output stream 
shifted by 1-bit from the next. Sampling across all the CLBs allows the GA to 
maximum flexibility in building the FSM. Thus, fitness is expressed as: 

F — max (CLBy) 

i=l,4;j=0,9 

where CLBJ represents the number of correct output bits from the ith CLB 
shifted by j clock ticks. The fitness is simply the highest number of correct 
output bits seen across all of the CLBs and across the ten output windows. The 
best score is 500, and the worst score is 0. 

The genetic algorithm was set up sis shown in Table 1. Small population 
sizes were necessary since an unfixablememory leak was present in one of the 
pre-compiled modules. 


Number of generations 

1000 

Population size 

40 j 

Tournament Size 

4 

Elitist Individuals 

2 j 

Gen 0 Seeding 

20 individuals) 

Crossover rate 

°.8 ;; 

[(Mutation rate 

0.002 per bit j 


Table 1 . GA parameters. 


Approximately 10 experimental runs were conducted using smaller input bit- 
streams of 100 bit pairs. These were found to evolve finite state machines that 





were tuned to the test cases, but not robust when interrogated with out of sam- 
ple input test streams. Two runs were conducted using 500 bit pairs and one 
these runs was able to evolve a 100% accurate quadrature decoder finite state 
machine in the presence of an induced fault. The location of the fault was chosen 
at random, although we made sure that it would adversely affect the function- 
ality of the seeded circuits. Once the fault is present, we assume that it does 
not get removed (however, if it does, our algorithm can start evolving the circuit 
configuration again). We assumed that the circuit is operating properly prior to 
the fault, and the evolution is started once the fault is detected. 

The best evolved configuration was found in generation 623 and is shown in 
Figure 6. Two of the 16 LUTs went unused which is not surprising given that 
the FSM can be implemented with about 10 LUTs. The GA exploits the induced 
fault to its advantage because if you remove the fault in the evolved solution, 
it no longer functions correctly - it achieves an accuracy of only 93.8%. Also, 
note that the input LUTs had mostly zeros in their tables. This is because we 
fix most of those bits to zero in the genome since they do not affect the LUT’s 
function. However, the “corner” bits of each of those input LUTs are involved 
in processing the input, and therefore, are evolved. 

The GA performance curve for this run is shown in Figure 7. The run ramps 
up quickly showing that useful search is underway, however, the average fitness 
is stagnant for about 300 generations, which is not encouraging. The runs are 
quite slow to execute on a 2 GHz Pentium 4 PC. Runtimes were about 45 hours 
since each evaluation takes approximately 6 seconds. 


6 Discussion 


Evolutionary systems for fault recovery on FPGAs may be an important tool in 
the quest for ever-higher levels of fault tolerance in NASA missions and other 
applications. We have demonstrated a system that is able to evolve a realistic 
spacecraft control function in the presence of a permanent stuck- at fault. Using 
a software simulation of an FPGA, we constructed a genetic representation that 
included both logic and routing information, and ran a genetic algorithm to 
evolve a quadrature decoder. As is typical in evolutionary algorithm applications, 
the evolved solution exploits its resources in unexpected ways. In our case, the 
algorithm made use of the fault itself in constructing its solution. If there is 
economy to be gained by exploiting damaged resources, that is certainly a benefit 
largely unique to evolutionary search. 

Potential advantages of this approach are handling a wider range of errors, 
and relaxing the requirement of fault location/isolation. An autonomous fault 
recovery system would be possible if the evolution could be done at sub-second 
speeds. Future work includes investigation of scalability to more complex logic 
functions and systems that have multiple induced faults. Speeding up the eval- 
uation cycle by doing evolution directly in hardware is our next line of research. 




Fig. 6. Evolved configuration showing routing, LUT contents, and simulated fault. 
Inputs are on the lines labeled MSB and LSB, referring to the least/most significant bit 
of the input (channel A and B inputs). Wires that are shown crossing perpendicularly 
(eg, +) are unconnected - only wires that have T junctions are connected. 
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